6 research outputs found
Informed algorithms for sound source separation in enclosed reverberant environments
While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing.
Initially, a multi-microphone array based method combined with binary
time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is
utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise.
To tackle the under-determined case and further improve separation performance
at higher reverberation times, a two-microphone based method
which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference,
interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned
through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as
the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial
characteristics of the enclosure and further improves the separation performance
in challenging scenarios i.e. when sources are in close proximity and
when the level of reverberation is high.
Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone
reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses
Speech separation with dereverberation-based pre-processing incorporating visual cues
Humans are skilled in selectively extracting a single sound
source in the presence of multiple simultaneous sounds. They
(individuals with normal hearing) can also robustly adapt to
changing acoustic environments with great ease. Need has
arisen to incorporate such abilities in machines which would
enable multiple application areas such as human-computer
interaction, automatic speech recognition, hearing aids and
hands-free telephony. This work addresses the problem of
separating multiple speech sources in realistic reverberant
rooms using two microphones.
Different monaural and binaural cues have previously
been modeled in order to enable separation. Binaural spatial
cues i.e. the interaural level difference (ILD) and the inter-
aural phase difference (IPD) have been modeled [1] in the
time-frequency (TF) domain that exploit the differences in
the intensity and the phase of the mixture signals (because of
the different spatial locations) observed by two microphones
(or ears). The method performs well with no or little rever-
beration but as the amount of reverberation increases and the
sources approach each other, the binaural cues are distorted
and the interaural cues become indistinct, hence, degrading
the separation performance. Thus, there is a demand for
exploiting additional cues, and further signal processing is
required at higher levels of reverberation
A new cascaded spectral subtraction approach for binaural speech dereverberation and its application in source separation
In this work we propose a new binaural spectral subtraction
method for the suppression of late reverberation. The pro-
posed approach is a cascade of three stages. The first two
stages exploit distinct observations to model and suppress the
late reverberation by deriving a gain function. The musical
noise artifacts generated due to the processing at each stage
are compensated by smoothing the spectral magnitudes of the
weighting gains. The third stage linearly combines the gains
obtained from the first two stages and further enhances the
binaural signals. The binaural gains, obtained by indepen-
dently processing the left and right channel signals are com-
bined using a new method. Experiments on real data are per-
formed in two contexts: dereverberation-only and joint dere-
verberation and source separation. Objective results verify
the suitability of the proposed cascaded approach in both the
contexts
An unsupervised acoustic fall detection system using source separation for sound interference suppression
We present a novel unsupervised fall detection system that employs the collected acoustic signals (footstep sound signals) from an elderly person׳s normal activities to construct a data description model to distinguish falls from non-falls. The measured acoustic signals are initially processed with a source separation (SS) technique to remove the possible interferences from other background sound sources. Mel-frequency cepstral coefficient (MFCC) features are next extracted from the processed signals and used to construct a data description model based on a one class support vector machine (OCSVM) method, which is finally applied to distinguish fall from non-fall sounds. Experiments on a recorded dataset confirm that our proposed fall detection system can achieve better performance, especially with high level of interference from other sound sources, as compared with existing single microphone based methods
Convolutive speech separation by combining probabilistic models employing the interaural spatial cues and properties of the room assisted by vision
In this paper a new combination of the model of the
interaural spatial cues and a model that utilizes spatial properties
of the sources is proposed to enhance speech separation in
reverberant environments. The algorithm exploits the knowledge
of the locations of the speech sources estimated through vision.
The interaural phase difference, the interaural level difference
and the contribution of each source to all mixture channels are
each modeled as Gaussian distributions in the time-frequency
domain and evaluated at individual time-frequency points. An
expectation-maximization (EM) algorithm is employed to refine
the estimates of the parameters of the models. The algorithm outputs
enhanced time-frequency masks that are used to reconstruct
individual speech sources. Experimental results confirm that the
combined video-assisted method is promising to separate sources
in real reverberant rooms
Video-aided model-based source separation in real reverberant rooms
Source separation algorithms that utilize only audio
data can perform poorly if multiple sources or reverberation
are present. In this paper we therefore propose a video-aided
model-based source separation algorithm for a two-channel
reverberant recording in which the sources are assumed static.
By exploiting cues from video, we first localize individual speech
sources in the enclosure and then estimate their directions.
The interaural spatial cues, the interaural phase difference and
the interaural level difference, as well as the mixing vectors
are probabilistically modeled. The models make use of the
source direction information and are evaluated at discrete timefrequency
points. The model parameters are refined with the wellknown
expectation-maximization (EM) algorithm. The algorithm
outputs time-frequency masks that are used to reconstruct the
individual sources. Simulation results show that by utilizing the
visual modality the proposed algorithm can produce better timefrequency
masks thereby giving improved source estimates. We
provide experimental results to test the proposed algorithm in
different scenarios and provide comparisons with both other
audio-only and audio-visual algorithms and achieve improved
performance both on synthetic and real data. We also include
dereverberation based pre-processing in our algorithm in order
to suppress the late reverberant components from the observed
stereo mixture and further enhance the overall output of the algorithm.
This advantage makes our algorithm a suitable candidate
for use in under-determined highly reverberant settings where
the performance of other audio-only and audio-visual methods
is limited